Take-home_Ex03

Published

March 12, 2023

Modified

March 14, 2023

Setting the Scene

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

Objectives

  • Predict the HDB resale prices (4-room) for the month of January and February 2023 using data from 1st January 2021 to 31st December 2022 as the training dataset.

  • Compare the results of the prediction by the conventional OLS method and the Geographically Weighted method.

1. Setup

Referencing Senior’s work

1.1 Datasets

  • Aspatial Data

    • HDB resale prices in Singapore from January 2021 to February 2023. It is in csv format and can be downloaded from Data.gov.sg.
  • Geospatial Data

    • 2019 Master Plan Planning Subzone
  • Locational factors with geographic coordinates

    • Childcare data in geojson format.

    • Eldercare data in shapefile format.

    • Hawker centre data in geojson format.

    • Parks data in geojson format.

    • MRT stations data in shapefile format.

    • Supermarkets data in geojson format.

  • Locational factors without geographic coordinates

    • CBD Coordinates to be scraped and obtained from Google

    • Good primary schools scraped from Local Salary Forum

1.2 Install and Load Packages

pacman::p_load(dplyr, rvest, olsrr, ggpubr, sf, spdep, GWmodel, tmap, gtsummary, readr, corrplot, tidyverse, httr, jsonlite, broom)

1.3 Loading data

resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

1.3.1 Filtering Data

We will filter our data to focus on:

  • 4-Room HDB flat sub-market.

  • prices from 1st January 2021 to 31st December 2022

resale_focused <- filter(resale, flat_type == "4 ROOM") %>%
  filter(month >= "2021-01" & month <= "2022-12")

2. Data Wrangling

2.1 Transforming resale data

We require the Area of the unit, Floor level, Remaining lease (months), Age of the unit.

  • The data is currently in Years and Months. We will have to standardise it so that calculation will be easier.

  • Our data also does not have coordinates. We will hence have to concatenate the block number and street name, then retrieve the coordinates from OneMap.

  • To get the age of unit, we need to subtract lease_commencement_date from month

rs_transform <- resale_focused %>% 
  mutate(resale_focused, address = paste(block, street_name)) %>%
  mutate(resale_focused, 
         remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
  mutate(resale_focused,
         remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11))) %>%
  mutate(resale_focused, 
         age_of_unit = as.integer(str_sub(month, 0, 4)) - as.integer(lease_commence_date))

Now, we have to:

  • Convert number of years to months by multiplying by 12 for age of unit and remaining lease year.

  • Convert all the NAs in remaining_lease_mth into 0s so that we can add the two columns together.

  • Store this in a new column called remaining_lease_mths.

rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform$age_of_unit <- rs_transform$age_of_unit * 12
rs_transform <- rs_transform %>% 
  mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
  select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, age_of_unit,
         lease_commence_date, remaining_lease_mths, resale_price)

2.2 Retrieve Postal Codes and Coordinates of Addresses

2.2.1 Create a list storing unique addresses

  • We create a list to store unique addresses to ensure that we do not run the GET request more than what is necessary

  • We can also sort it to make it easier for us to see at which address the GET request will fail.

  • Here, we use unique() function of base R package to extract the unique addresses then use sort() function of base R package to sort the unique vector.

address_list <- sort(unique(rs_transform$address))

2.2.2 Creating a function to retrieve coordinates from OneMap

1.  Firstly, we create a dataframe called `postal_coords` to store all the final retrieved coordinates


2.  Secondly, we first use *GET()* function of httr package to make a GET request to [*https://developers.onemap.sg/commonapi/search*](https://developers.onemap.sg/commonapi/search)

-   OneMap SG offers functions for us to query spatial data from the API in a tidy format and provides additional functionalities to allow easy data manipulation.**

-   Here, we will be using their REST APIs to search address data for a given search value and retrieve the coordinates of the searched location.**

-   The required variables to be included in the GET request is as follows:

    -   **`searchVal`**: Keywords entered by user that is used to filter out the results.

    -   **`returnGeom`** {Y/N}: Checks if user wants to return the geometry.

    -   **`getAddrDetails`** {Y/N}: Checks if user wants to return address details for a point.

-   **Note**:

    -   The JSON response returned will contain multiple fields.

    -   However, we are only interested in the postal code and coordinates like Latitude & Longitude.

    -   On their website, they also made an announcement on a minor text fix where they changed the word \"LONGTITUDE\" to \"LONGITUDE\" which we will be using the latter in this analysis.


3.  We then create a dataframe `new_row` which will be used to store each final set of coordinates retrieved during the loop


4.  We also need to check the number of responses returned and append to the main dataframe accordingly. This is because:

-   The no. of returned responses of the searched location, (indicated by variable `found`) , varies as some location might have only a single result while other locations might return multiple results.

    -   For example, the address 2 JLN BATU returns 3 sets of postal codes and coordinates ( meaning `found` = 3).

    -   Hence, what we can do is to first look at only those that does not have empty postal codes then take the first set/row of the coordinates

-   We can also check to see if the address is invalid by looking at the number of rows returned by request.

-   There will also be some addresses searched that are invalid. ( means `found` = 0)

-   This step was helpful in determining what was causing the error of the API Call. We will see in the later section what errors was caused by the invalid searched errors.


5.  Lastly, we will append the returned response (`new_row`) with the necessary fields to the main dataframe (`postal_coords`) using *rbind()* function of base R package.
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

2.2.3 Retrieve resale coordinates

coords <- get_coords(address_list)

Check if there are any NA/NIL values

coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]

Looking for 215 Choa Chu Kang Central on OneMap returns us with “Blk 215 and 216 Choa Chu Kang Central” with two different postal codes. This could be the reason why a postal code was not assigned. However, since we have the coordinates, we can proceed.

2.2.4 Combine resale and coordinates data

rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))
head(rs_coords)
rs_coords_final <- rs_coords %>% select(c("latitude", "longitude"))

Write file to rds

rs_coords_rds <- write_rds(rs_coords, "data/aspatial/rds/rs_coords.rds")

2.3 Transform CRS and check

Read RDS

rs_coords <- read_rds("data/aspatial/rds/rs_coords.rds")

Since the coordinates are in decimal degrees, the projected CRS will be WSG84.

We will hence need to assign it as CRS 4326 before transforming it to 3414 which is the EPSG code for SVY21.

  • convert data frame into sf object

  • transform the coordinates of the sf object

rs_coords_sf <- st_as_sf(rs_coords, 
                         coords = c("longitude", 
                                    "latitude"),
                         crs = 4326) %>%
  st_transform(crs = 3414)

Check EPSG

st_crs(rs_coords_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

2.3.1 Check for invalid geometries

length(which(st_is_valid(rs_coords_sf) == FALSE))
[1] 0
tmap_mode("view")
tm_shape(rs_coords_sf)+
  tm_dots(col="blue", size = 0.02)
tmap_mode("plot")

3. Loading Locational Factors

3.1 Locational Factors with Coordinates

  • Childcare data in geojson format.

  • Eldercare data in shapefile format.

  • Hawker centre data in geojson format.

  • Parks data in kml format.

  • MRT stations data in shapefile format.

  • Supermarkets data in geojson format.

elder_sf <- st_read(dsn = "data/geospatial", layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
mrt_sf <- st_read(dsn = "data/geospatial", layer = "Train_station_Exit_Layer")
Reading layer `Train_station_Exit_Layer' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 562 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
childcare_sf<- st_read("data/geospatial/child-care-services-geojson.geojson")
Reading layer `child-care-services-geojson' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial\child-care-services-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1545 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6824 ymin: 1.248403 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
hawker_sf <- st_read("data/geospatial/hawker-centres-geojson.geojson")
Reading layer `hawker-centres-geojson' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial\hawker-centres-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449217
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
supermarkets_sf <- st_read("data/geospatial/supermarkets-geojson.geojson")
Reading layer `supermarkets-geojson' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
parks_sf <- st_read("data/geospatial/parks.kml")
Reading layer `NATIONALPARKS_New' from data source 
  `C:\zoe-chia\IS415\Take-home_Ex\Take-home_Ex03\data\geospatial\parks.kml' 
  using driver `KML'
Simple feature collection with 421 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
st_crs(elder_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(mrt_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(childcare_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(hawker_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(supermarkets_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(parks_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
  • Train_station_Exit_Layer and ELDERCARE have their projected CRS as SVY21, which is EPSG 9001.

  • Childcare, hawker centres, supermarkets, national parks have their projected CRS as WGS84, which is EPSG 4326.

  • Hence, we will need to convert the EPSG to 3414 for these datasets.

elder_sf <- st_set_crs(elder_sf, 3414)
mrt_sf <- st_set_crs(mrt_sf, 3414)


hawker_sf <- hawker_sf %>%
  st_transform(crs = 3414)
parks_sf <- parks_sf %>%
  st_transform(crs = 3414)
supermarkets_sf <- supermarkets_sf %>%
  st_transform(crs = 3414)
childcare_sf <- childcare_sf %>%
  st_transform(crs = 3414)

3.1.1 Check for invalid geometries

length(which(st_is_valid(elder_sf) == FALSE))
[1] 0
length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(hawker_sf) == FALSE))
[1] 0
length(which(st_is_valid(parks_sf) == FALSE))
[1] 0
length(which(st_is_valid(supermarkets_sf) == FALSE))
[1] 0
length(which(st_is_valid(childcare_sf) == FALSE))
[1] 0

3.2 Proximity function

3.2.1 Create get proximity function

The following code chunk performs 3 steps:

  • It will create a matrix of distances between the HDB and the locational factor using st_distance of sf package.

  • It will also get the nearest point of the locational factor by looking at the minimum distance using min function of base R package then add it to HDB resale data under a new column using mutate() function of dpylr package. (Find the nearest location to that HDB, calculate proximity and add to column)

  • Lastly, it will rename the column name according to input given by user so that the columns have appropriate and distinct names that are different from one another.

get_prox <- function(origin_df, dest_df, col_name){
  
  # creates a matrix of distances
  dist_matrix <- st_distance(origin_df, dest_df)           
  
  # find the nearest location_factor and create new data frame
  near <- origin_df %>% 
    mutate(PROX = apply(dist_matrix, 1, function(x) min(x)) / 1000) 
  
  # rename column name according to input parameter
  names(near)[names(near) == 'PROX'] <- col_name

  # Return df
  return(near)
}

3.2.2 Find proximity

rs_coords_sf <- get_prox(rs_coords_sf, elder_sf, "PROX_ELDERLYCARE") 
rs_coords_sf <- get_prox(rs_coords_sf, mrt_sf, "PROX_MRT") 
rs_coords_sf <- get_prox(rs_coords_sf, hawker_sf, "PROX_HAWKER") 
rs_coords_sf <- get_prox(rs_coords_sf, parks_sf, "PROX_PARK") 
rs_coords_sf <- get_prox(rs_coords_sf, supermarkets_sf, "PROX_SUPERMARKET")